In this project, we are examining the growth of in the use of the term “diversity”. To do this, we drew from the MEDLINE database in Web of Science, using the search terms “TS=(diversity)” from 1990-2017 for human research only. This search provided 71,528 total results, which we extracted using the bibliometrix package in R. Next, we converted the abstracts of these articles to a text corpus and then used tidytext - a package designed for computational text analysis in R - to analyze patterns with the abstracts of these data. Below is the R Markdown file and replication code for these analyses.
In this first chunk, we load our data and examine the overall growth of articles from our search query. As we can see, there is a pretty sizable growth in scientific literature that uses the term diversity - from about 500 times in 1990 to over 5000 in 2017.
# loading the .csv file
text_data <- read_csv("historical_text_data.csv")
# checking to see how the overall data looks
by_year <- text_data %>%
filter(year != "2018") %>% # filtering 2018 articles because they seem to be incomplete
group_by(year) %>% count(year, sort = TRUE) %>% ungroup()
by_year <- ggplot() + geom_line(aes(y = n, x = year), data = by_year, stat="identity") +
labs(title = "Growth in Diversity-Related Publications from 1990-2017",
caption = "Data Source: Web of Science") +
theme(axis.title.x = element_blank(), axis.title.y = element_blank())
by_year <- ggplotly(by_year); by_year
Next, we want to look at word frequencies by year in the literature. This chunk of code breaks down how common words occur in the abstracts of our dataset. Note that we also remove some frequently occurring words that are not really relevant to our dataset, but these do not systematically alter our results.
# tokenizing the abstract data into words
abstract_data <- text_data %>%
unnest_tokens(word, abstract) %>%
anti_join(stop_words)
## Joining, by = "word"
# most frequent word count in abstracts
abstract_data %>%
count(word, sort = TRUE)
## # A tibble: 159,784 x 2
## word n
## <chr> <int>
## 1 diversity 77112
## 2 study 34721
## 3 patients 32787
## 4 human 32166
## 5 results 29482
## 6 1 27696
## 7 genetic 27652
## 8 health 25787
## 9 cell 25463
## 10 analysis 25431
## # ... with 159,774 more rows
# adding custom set of stopwords
my_stopwords <- tibble(word = c(as.character(1:9),
"1", "2", "3", "4", "5", "6", "7", "8", "9", "10",
"rights", "reserved", "copyright", "elsevier"))
abstract_data <- abstract_data %>% anti_join(my_stopwords)
## Joining, by = "word"
# looking at word frequencies by year
abstract_words <- abstract_data %>%
filter(year != "2018") %>%
group_by(year) %>%
count(word, sort = TRUE) %>% ungroup(); abstract_words
## # A tibble: 650,713 x 3
## year word n
## <dbl> <chr> <int>
## 1 2017 diversity 6399
## 2 2016 diversity 6154
## 3 2015 diversity 5875
## 4 2014 diversity 5129
## 5 2013 diversity 5036
## 6 2012 diversity 4693
## 7 2011 diversity 4055
## 8 2010 diversity 3794
## 9 2009 diversity 3336
## 10 2017 study 3324
## # ... with 650,703 more rows
Now, we want to look at how the most relevant words vary over time. Brandon chose to include words like diversity, genetic, and population as well as racially-specific and ethnically-specific terms. As we see, the rise of diversity does not necessarily mean that the focus on race or ethnicity is growing in congruence with that term. This could mean that diversity is being used as a catch-all in the scientific literature (i.e. that the multiplicity of the term makes it mean anything and everything) or that diversity is most used in fields like immunology or oncology. We will explore that hypothesis a bit more below.
diversity_terms <- abstract_words %>%
filter(year != "2018") %>%
filter(word == "diversity" | word == 'genetic' | word == "population" |
word == "ethnic" | word == "racial" | word == 'race' |
word == 'caucasian' | word == 'african' | word == 'black')
diversity_terms
## # A tibble: 252 x 3
## year word n
## <dbl> <chr> <int>
## 1 2017 diversity 6399
## 2 2016 diversity 6154
## 3 2015 diversity 5875
## 4 2014 diversity 5129
## 5 2013 diversity 5036
## 6 2012 diversity 4693
## 7 2011 diversity 4055
## 8 2010 diversity 3794
## 9 2009 diversity 3336
## 10 2008 diversity 2896
## # ... with 242 more rows
word_graph <- ggplot() + geom_line(aes(y = n, x = year, colour = word),
data = diversity_terms, stat="identity") +
labs(title = "Growth in Diversity-Related Terms (1990-2017)") +
theme(axis.title.x = element_blank(), axis.title.y = element_blank(), legend.title = element_blank())
interactive_graph <- ggplotly(word_graph); interactive_graph
While that was quite useful, we do not see much variation in how race/ethinicity vary over time. One possibility is that variation in those terms is “diluted” by the various terms that researchers use. Thus, the next logical step is to collapse all of the population-specific terms into one category and compare that to other terminology over time. Both Lee (2009) and Kramer (2019) as well as others (e.g. Panofsky and Bliss 2017) have found that biomedical researchers continue to use population-specific terminology to reinforce the notion of population differences in various biological markers. Below, we have collapsed several population-specific terms into one category based. This list of terms which were developed out of Kramer’s (2019) dissertation work. While we won’t claim that this is an exhaustive list of all the population terms that exist in the world, it is a fairly comprehensive list of over 2,100 terms (also see below in text networks section). Let’s take a look how much the use of all these population-specific terms grew over time…
population_specific <- read_csv("population_terms.csv")
population_specific <- paste(c("\\b(?i)(zcx", population_specific$term, "zxc)\\b"), collapse = "|")
recoded_abstract_data <- abstract_data %>%
mutate(recoded_word = ifelse(test = str_detect(string = abstract_data$word, pattern = population_specific),
yes = "population-specific", no = word))
recoded_abstract_data %>% filter(recoded_word == "population-specific")
## # A tibble: 120,800 x 14
## id author title publication year department subject grant_informati~
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 4 DUBOI~ LONG~ AIDS (LOND~ 1999 UNIVERSIT~ IMMUNO~ <NA>
## 2 4 DUBOI~ LONG~ AIDS (LOND~ 1999 UNIVERSIT~ IMMUNO~ <NA>
## 3 15 OUBIN~ GENE~ VIRUS RESE~ 1999 LABORATOR~ IMMUNO~ <NA>
## 4 15 OUBIN~ GENE~ VIRUS RESE~ 1999 LABORATOR~ IMMUNO~ <NA>
## 5 16 MONTA~ THE ~ AIDS RESEA~ 1999 LABORATOI~ GENETI~ <NA>
## 6 16 MONTA~ THE ~ AIDS RESEA~ 1999 LABORATOI~ GENETI~ <NA>
## 7 16 MONTA~ THE ~ AIDS RESEA~ 1999 LABORATOI~ GENETI~ <NA>
## 8 16 MONTA~ THE ~ AIDS RESEA~ 1999 LABORATOI~ GENETI~ <NA>
## 9 16 MONTA~ THE ~ AIDS RESEA~ 1999 LABORATOI~ GENETI~ <NA>
## 10 16 MONTA~ THE ~ AIDS RESEA~ 1999 LABORATOI~ GENETI~ <NA>
## # ... with 120,790 more rows, and 6 more variables: keyword <chr>,
## # pubmed_id <chr>, doi <chr>, country <chr>, word <chr>,
## # recoded_word <chr>
recoded_abstract_words <- recoded_abstract_data %>%
filter(year != "2018") %>%
group_by(year) %>%
count(recoded_word, sort = TRUE) %>% ungroup(); recoded_abstract_words
## # A tibble: 642,447 x 3
## year recoded_word n
## <dbl> <chr> <int>
## 1 2016 population-specific 9308
## 2 2017 population-specific 9271
## 3 2015 population-specific 9208
## 4 2013 population-specific 7809
## 5 2014 population-specific 7733
## 6 2012 population-specific 7036
## 7 2011 population-specific 6484
## 8 2010 population-specific 6434
## 9 2017 diversity 6399
## 10 2016 diversity 6154
## # ... with 642,437 more rows
diversity_terms <- recoded_abstract_words %>%
filter(recoded_word == "diversity" | recoded_word == "genetic" |
recoded_word == "population" | recoded_word == "population-specific")
diversity_terms
## # A tibble: 112 x 3
## year recoded_word n
## <dbl> <chr> <int>
## 1 2016 population-specific 9308
## 2 2017 population-specific 9271
## 3 2015 population-specific 9208
## 4 2013 population-specific 7809
## 5 2014 population-specific 7733
## 6 2012 population-specific 7036
## 7 2011 population-specific 6484
## 8 2010 population-specific 6434
## 9 2017 diversity 6399
## 10 2016 diversity 6154
## # ... with 102 more rows
word_graph <- ggplot() + geom_line(aes(y = n, x = year, colour = recoded_word),
data = diversity_terms, stat="identity") +
labs(title = "Growth in Diversity-Related Terminology (1990-2017)") +
theme(axis.title.x = element_blank(), axis.title.y = element_blank(), legend.title = element_blank())
interactive_graph <- ggplotly(word_graph); interactive_graph
Interesting! After we collapsed all of the terms together in the population-specific category, we see a dramatic growth that goes well beyond the growth in diversity. Now, let’s see what the growth of these terms look like when we break it down by general, national, continential, and ethnic groups across various geographies.
general_pop_terms <- read_csv("population_terms.csv") %>% filter(sub_category == "population_general")
us_specific_terms <- read_csv("population_terms.csv") %>% filter(sub_category == "us_specific")
continental_terms <- read_csv("population_terms.csv") %>%
filter(category == "continental" | category == "subcontinental")
ling_religious_terms <- read_csv("population_terms.csv") %>% filter(category == "linguistic_religious")
national_terms <- read_csv("population_terms.csv") %>% filter(category == "national")
south_american_ethnic_groups <- read_csv("population_terms.csv") %>% filter(category == "south_america")
african_ethnic_groups <- read_csv("population_terms.csv") %>% filter(category == "africa")
north_american_ethnic_groups <- read_csv("population_terms.csv") %>% filter(category == "north_america")
european_ethnic_groups <- read_csv("population_terms.csv") %>% filter(category == "europe")
asian_ethnic_groups <- read_csv("population_terms.csv") %>% filter(category == "asia")
all_ethnic_groups <- read_csv("population_terms.csv") %>%
filter(category == "south_america" | category == "africa" |
category == "asia" | category == "north_america" | category == "europe")
general_pop_terms <- paste(c("\\b(?i)(zxz", general_pop_terms$term, "zxz)\\b"), collapse = "|")
us_specific_terms <- paste(c("\\b(?i)(zxz", us_specific_terms$term, "zxz)\\b"), collapse = "|")
continental_terms <- paste(c("\\b(?i)(zxz", continental_terms$term, "zxz)\\b"), collapse = "|")
ling_religious_terms <- paste(c("\\b(?i)(zxz", ling_religious_terms$term, "zxz)\\b"), collapse = "|")
national_terms <- paste(c("\\b(?i)(zxz", national_terms$term, "zxz)\\b"), collapse = "|")
south_american_ethnic_groups <- paste(c("\\b(?i)(zxz", south_american_ethnic_groups$term, "zxz)\\b"), collapse = "|")
african_ethnic_groups <- paste(c("\\b(?i)(zxz", african_ethnic_groups$term, "zxz)\\b"), collapse = "|")
north_american_ethnic_groups <- paste(c("\\b(?i)(zxz", north_american_ethnic_groups$term, "zxz)\\b"), collapse = "|")
european_ethnic_groups <- paste(c("\\b(?i)(zxz", european_ethnic_groups$term, "zxz)\\b"), collapse = "|")
asian_ethnic_groups <- paste(c("\\b(?i)(zxz", asian_ethnic_groups$term, "zxz)\\b"), collapse = "|")
all_ethnic_groups <- paste(c("\\b(?i)(zxz", all_ethnic_groups$term, "zxz)\\b"), collapse = "|")
recoded_abstract_data <- abstract_data %>%
mutate(recoded_word = ifelse(test = str_detect(string = abstract_data$word, pattern = general_pop_terms),
yes = "general population terms", no = word)) %>%
mutate(recoded_word = ifelse(test = str_detect(string = abstract_data$word, pattern = us_specific_terms),
yes = "us-specific terms", no = recoded_word)) %>%
mutate(recoded_word = ifelse(test = str_detect(string = abstract_data$word, pattern = continental_terms),
yes = "continental terms", no = recoded_word)) %>%
mutate(recoded_word = ifelse(test = str_detect(string = abstract_data$word, pattern = ling_religious_terms),
yes = "linguistic & religious terms", no = recoded_word)) %>%
mutate(recoded_word = ifelse(test = str_detect(string = abstract_data$word, pattern = national_terms),
yes = "national terms", no = recoded_word)) %>%
mutate(recoded_word = ifelse(test = str_detect(string = abstract_data$word, pattern = south_american_ethnic_groups),
yes = "south american ethnic groups", no = recoded_word)) %>%
mutate(recoded_word = ifelse(test = str_detect(string = abstract_data$word, pattern = african_ethnic_groups),
yes = "african ethnic groups", no = recoded_word)) %>%
mutate(recoded_word = ifelse(test = str_detect(string = abstract_data$word, pattern = north_american_ethnic_groups),
yes = "north american ethnic groups", no = recoded_word)) %>%
mutate(recoded_word = ifelse(test = str_detect(string = abstract_data$word, pattern = european_ethnic_groups),
yes = "european ethnic groups", no = recoded_word)) %>%
mutate(recoded_word = ifelse(test = str_detect(string = abstract_data$word, pattern = asian_ethnic_groups),
yes = "asian ethnic groups", no = recoded_word))
recoded_abstract_words <- recoded_abstract_data %>%
filter(year != "2018") %>%
group_by(year) %>%
count(recoded_word, sort = TRUE) %>% ungroup(); recoded_abstract_words
## # A tibble: 642,695 x 3
## year recoded_word n
## <dbl> <chr> <int>
## 1 2017 diversity 6399
## 2 2016 diversity 6154
## 3 2015 diversity 5875
## 4 2014 diversity 5129
## 5 2013 diversity 5036
## 6 2012 diversity 4693
## 7 2015 national terms 4406
## 8 2016 national terms 4358
## 9 2017 national terms 4291
## 10 2011 diversity 4055
## # ... with 642,685 more rows
diversity_terms <- recoded_abstract_words %>%
filter(recoded_word == "diversity" | recoded_word == "population" | recoded_word == "population-specific" |
recoded_word == "general population terms" | recoded_word == "continental terms" | recoded_word == "linguistic & religious terms" |
recoded_word == "national terms" | recoded_word == "south american ethnic groups" | recoded_word == "african ethnic groups" |
recoded_word == "north american ethnic groups" | recoded_word == "european ethnic groups" | recoded_word == "asian ethnic groups" |
recoded_word == "us-specific terms")
diversity_terms
## # A tibble: 332 x 3
## year recoded_word n
## <dbl> <chr> <int>
## 1 2017 diversity 6399
## 2 2016 diversity 6154
## 3 2015 diversity 5875
## 4 2014 diversity 5129
## 5 2013 diversity 5036
## 6 2012 diversity 4693
## 7 2015 national terms 4406
## 8 2016 national terms 4358
## 9 2017 national terms 4291
## 10 2011 diversity 4055
## # ... with 322 more rows
word_graph <- ggplot() + geom_line(aes(y = n, x = year, colour = recoded_word),
data = diversity_terms, stat="identity") +
labs(title = "Growth in Diversity-Related Terminology (1990-2017)") +
theme(axis.title.x = element_blank(), axis.title.y = element_blank(), legend.title = element_blank())
interactive_graph <- ggplotly(word_graph); interactive_graph
As we can see here, the majority of growth in population-specific terms is because of an increased proclivity to mention nationalities (e.g. Italian or Japanese), continental terms (e.g. africa, asia, north america), and general population terms (e.g. race, ethnicity, caucasian, african, asian, etc). We do not really see much growth in ethnic groups in any given continent even when we lump all of the different ethnic groups into one category (not shown here).
The next step is to look at how the concept of “diversity” is used across the world. As the graph below demonstrates, the rise of “diversity” seems mostly to grow in the context of predominantly White, Westernized countries like US, England, the Netherlands, Germany and Switzerland.
# here we are just converting everything to lower case
abstract_data$country <- tolower(abstract_data$country)
# looking at word frequencies by year
diversity_by_country <- abstract_data %>%
filter(year != "2018") %>%
group_by(year) %>%
count(word, country, sort = TRUE) %>% ungroup(); diversity_by_country
## # A tibble: 1,424,517 x 4
## year word country n
## <dbl> <chr> <chr> <int>
## 1 2017 diversity united states 2781
## 2 2016 diversity united states 2753
## 3 2015 diversity united states 2635
## 4 2014 diversity united states 2464
## 5 2013 diversity united states 2401
## 6 2012 diversity united states 2324
## 7 2017 diversity england 2261
## 8 2016 diversity england 1982
## 9 2011 diversity united states 1919
## 10 2015 diversity england 1894
## # ... with 1,424,507 more rows
diversity_by_country <- diversity_by_country %>%
filter(word == "diversity")
diversity_by_country
## # A tibble: 1,001 x 4
## year word country n
## <dbl> <chr> <chr> <int>
## 1 2017 diversity united states 2781
## 2 2016 diversity united states 2753
## 3 2015 diversity united states 2635
## 4 2014 diversity united states 2464
## 5 2013 diversity united states 2401
## 6 2012 diversity united states 2324
## 7 2017 diversity england 2261
## 8 2016 diversity england 1982
## 9 2011 diversity united states 1919
## 10 2015 diversity england 1894
## # ... with 991 more rows
diversity_over_time <- ggplot() + geom_line(aes(y = n, x = year, colour = country),
data = diversity_by_country, stat="identity") +
labs(title = "Growth in Diversity-Related Terminology (From 1990-2017, By Country)") +
theme(axis.title.x = element_blank(), axis.title.y = element_blank(), legend.title = element_blank())
diversity_over_time <- ggplotly(diversity_over_time); diversity_over_time
Lastly, we wanted to look more into the growth of diversity related terms by scientific subject matter. This snippet of code breaks down the number of words occuring in abstracts over time, which is then broken down by MEDLINE’s and Web of Science’s subject categories. I have opted to only include 12 of the 150 different categories that could have been graphed here. Overall, we see that the rise of diversity in genetics & heredity, biochemistry & molecular biology, microbiology, immunology, and infectious disease research. We do not see this same rise in the social sciences, though there admittedly is some overall growth in that domain.
# first we need to break apart all the subject categories for each paper
text_data <- text_data %>%
separate(subject, into = paste("subject", 1:15, sep = "_"), sep = ";") %>%
gather(value, subject, subject_1:subject_15, na.rm = TRUE) %>% select(-value)
# and remove the annoying parentheticals from some categories
text_data <- text_data %>%
separate(subject, into = c("subject", "void"), sep = "[(]") %>% select(-void)
# then we git rid of the extra white space and make everything lower case to standardize
text_data$subject <- stri_trim_both(text_data$subject)
text_data$subject <- tolower(text_data$subject)
# this shows us we have 134 different subjects and we appear to have removed all the duplicates
unique(text_data$subject)
## [1] "biochemistry & molecular biology"
## [2] "genetics & heredity"
## [3] "immunology"
## [4] "pharmacology & pharmacy"
## [5] "biophysics"
## [6] "pediatrics"
## [7] "cardiovascular system & cardiology"
## [8] "microbiology"
## [9] "infectious diseases"
## [10] "cell biology"
## [11] "evolutionary biology"
## [12] "medical ethics"
## [13] "nursing"
## [14] "anthropology"
## [15] "toxicology"
## [16] "hematology"
## [17] "psychiatry"
## [18] "psychology"
## [19] "geriatrics & gerontology"
## [20] "ethnic studies"
## [21] "oncology"
## [22] "zoology"
## [23] "health care sciences & services"
## [24] "pathology"
## [25] "behavioral sciences"
## [26] "mathematics"
## [27] "cultural studies"
## [28] "research & experimental medicine"
## [29] "education & educational research"
## [30] "medical informatics"
## [31] "information science & library science"
## [32] "dentistry, oral surgery & medicine"
## [33] "neurosciences & neurology"
## [34] "dermatology"
## [35] "physiology"
## [36] "fisheries"
## [37] "nutrition & dietetics"
## [38] "environmental sciences & ecology"
## [39] "computer science"
## [40] "demography"
## [41] "entomology"
## [42] "gastroenterology & hepatology"
## [43] "general & internal medicine"
## [44] "parasitology"
## [45] "otorhinolaryngology"
## [46] "respiratory system"
## [47] "virology"
## [48] "communication"
## [49] "public, environmental & occupational health"
## [50] "meteorology & atmospheric sciences"
## [51] "orthopedics"
## [52] "medical laboratory technology"
## [53] "business & economics"
## [54] "history"
## [55] "surgery"
## [56] "sociology"
## [57] "anatomy & morphology"
## [58] "ophthalmology"
## [59] "agriculture"
## [60] "urology & nephrology"
## [61] "legal medicine"
## [62] "food science & technology"
## [63] "biotechnology & applied microbiology"
## [64] "religion"
## [65] "criminology & penology"
## [66] "endocrinology & metabolism"
## [67] "philosophy"
## [68] "developmental biology"
## [69] "archaeology"
## [70] "audiology & speech-language pathology"
## [71] "rheumatology"
## [72] "anesthesiology"
## [73] "government & law"
## [74] "allergy"
## [75] "materials science"
## [76] "social issues"
## [77] "microscopy"
## [78] "obstetrics & gynecology"
## [79] "substance abuse"
## [80] "reproductive biology"
## [81] "chemistry"
## [82] "radiology, nuclear medicine & medical imaging"
## [83] "integrative & complementary medicine"
## [84] "biodiversity & conservation"
## [85] "veterinary sciences"
## [86] "transplantation"
## [87] "imaging science & photographic technology"
## [88] "plant sciences"
## [89] "nuclear science & technology"
## [90] "social sciences - other topics"
## [91] "family studies"
## [92] "mycology"
## [93] "life sciences & biomedicine - other topics"
## [94] "acoustics"
## [95] "international relations"
## [96] "physics"
## [97] "rehabilitation"
## [98] "critical care medicine"
## [99] "robotics"
## [100] "engineering"
## [101] "geography"
## [102] "emergency medicine"
## [103] "tropical medicine"
## [104] "music"
## [105] "electrochemistry"
## [106] "energy & fuels"
## [107] "architecture"
## [108] "automation & control systems"
## [109] "women&apos"
## [110] "art"
## [111] "sport sciences"
## [112] "paleontology"
## [113] "science & technology - other topics"
## [114] "astronomy & astrophysics"
## [115] "linguistics"
## [116] "urban studies"
## [117] "film, radio & television"
## [118] "telecommunications"
## [119] "forestry"
## [120] "mining & mineral processing"
## [121] "optics"
## [122] "arts & humanities - other topics"
## [123] "history & philosophy of science"
## [124] "thermodynamics"
## [125] "literature"
## [126] "s studies"
## [127] "marine & freshwater biology"
## [128] "social work"
## [129] "geology"
## [130] "operations research & management science"
## [131] "theater"
## [132] "oceanography"
## [133] "metallurgy & metallurgical engineering"
## [134] "water resources"
# now we can see how often these words arise by subject
subject_data <- text_data %>%
unnest_tokens(word, abstract) %>%
anti_join(stop_words)
growth_by_subject <- subject_data %>%
filter(year != "2018") %>%
group_by(year) %>%
count(word, subject, sort = TRUE) %>% ungroup()
subject_data %>%
count(subject, sort = TRUE)
## # A tibble: 134 x 2
## subject n
## <chr> <int>
## 1 genetics & heredity 4213638
## 2 biochemistry & molecular biology 4016585
## 3 microbiology 2216691
## 4 immunology 1986325
## 5 infectious diseases 1857374
## 6 pharmacology & pharmacy 1466127
## 7 behavioral sciences 1464061
## 8 psychology 1356911
## 9 pediatrics 1344591
## 10 cell biology 1321172
## # ... with 124 more rows
graph_by_subject <- growth_by_subject %>%
filter(word == "diversity") %>%
filter(subject == "genetics & heredity" | subject == "biochemistry & molecular biology" |
subject == "microbiology" | subject == "infectious diseases" | subject == "immunology" |
subject == "pharmacology & pharmacy" | subject == "behavioral sciences" |
subject == "health care sciences & services" | subject == "neurosciences & neurology" |
subject == "psychology" | subject == "sociology" |
subject == "oncology" | subject == "business & economics"
)
graph_by_subject <- ggplot() + geom_line(aes(y = n, x = year, colour = subject),
data = graph_by_subject, stat="identity") +
labs(title = "Growth in Diversity-Related Terminology (From 1990-2017), By Subject)") +
theme(axis.title.x = element_blank(), axis.title.y = element_blank(), legend.title = element_blank())
graph_by_subject <- ggplotly(graph_by_subject); graph_by_subject
Instead of selecting particular subjects, we also wanted to know what would happen if we just collapsed the categories into three buckets: (1) genetics & heredity (i.e. evolution, plant and animal sciences), (2) biomedical studies, and (3) the social and behavioral sciences. As a general side note, when one of the original categories did not fit into these three recoded categories (e.g. mathematics or computer science), they were just recoded as “other” and ignored in our final analysis. For a full list of how the original 134 categories were recoded, you can visit this link.
genetics_heredity <- read_csv("subject_categories.csv") %>% filter(collapsed_subject == "genetics_heredity")
biomedical_studies <- read_csv("subject_categories.csv") %>% filter(collapsed_subject == "biomedical_studies")
social_behavioral <- read_csv("subject_categories.csv") %>% filter(collapsed_subject == "social_behavioral")
other_studies <- read_csv("subject_categories.csv") %>% filter(collapsed_subject == "other")
genetics_heredity <- paste(c("\\b(?i)(zxz", genetics_heredity$original_subject, "zxz)\\b"), collapse = "|")
biomedical_studies <- paste(c("\\b(?i)(zxz", biomedical_studies$original_subject, "zxz)\\b"), collapse = "|")
social_behavioral <- paste(c("\\b(?i)(zxz", social_behavioral$original_subject, "zxz)\\b"), collapse = "|")
other_studies <- paste(c("\\b(?i)(zxz", other_studies$original_subject, "zxz)\\b"), collapse = "|")
recoded_subject_data <- subject_data %>%
select(subject, word, year) %>%
mutate(recoded_subject = ifelse(test = str_detect(string = subject_data$subject, pattern = genetics_heredity),
yes = "genetics & heredity", no = subject)) %>%
mutate(recoded_subject = ifelse(test = str_detect(string = subject_data$subject, pattern = biomedical_studies),
yes = "biomedical studies", no = recoded_subject)) %>%
mutate(recoded_subject = ifelse(test = str_detect(string = subject_data$subject, pattern = social_behavioral),
yes = "social & behavioral sciences", no = recoded_subject)) %>%
mutate(recoded_subject = ifelse(test = str_detect(string = subject_data$subject, pattern = other_studies),
yes = "other subjects", no = recoded_subject))
growth_by_subject <- recoded_subject_data %>%
filter(year != "2018") %>%
group_by(year) %>%
count(word, recoded_subject, sort = TRUE) %>% ungroup()
graph_by_subject <- growth_by_subject %>%
filter(word == "diversity")
graph_by_subject <- ggplot() + geom_line(aes(y = n, x = year, colour = recoded_subject),
data = graph_by_subject, stat="identity") +
labs(title = "Growth in Diversity-Related Terminology (From 1990-2017), By Subject") +
theme(axis.title.x = element_blank(), axis.title.y = element_blank(), legend.title = element_blank())
interactive_by_subject <- ggplotly(graph_by_subject); interactive_by_subject
Perhaps this is just a by-product of more categories being biomedically-related, but this analysis clearly shows that diversity seems to be growing faster in biomedical studies than in the social & behavioral sciences and genetics & heredity-related studies.
Overall, this document shows a rise in the use of “diversity” across scientific research. We see a 10-fold increase across the 1990s and 2000’s, which mostly occurs in research deriving from Westernized biomedical scientific contexts. Our future analyses will examine more what implications this has for the use of diversity in and outside of that domain.
Kramer, B. L. (2019). Molecularization at the intersections: testosterone, prostate cancer and the construction of racial difference. Doctoral dissertation, Rutgers University-School of Graduate Studies.
Lee, C. (2009). “Race” and “ethnicity” in biomedical research: how do scientists construct and explain differences in health?. Social Science & Medicine, 68(6), 1183-1190.
Panofsky, A., & Bliss, C. (2017). Ambiguity and scientific authority: population classification in genomic science. American Sociological Review, 82(1), 59-87.